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Abstract. We consider the problem of designing a sparse Gaussian pro- 
cess classifier (SGPC) that generalizes well. Viewing SGPC design as 
constructing an additive model like in boosting, we present an efficient 
and effective SGPC design method to perform a stage-wise optimization 
of a predictive loss function. We introduce new methods for two key com- 
ponents viz., site parameter estimation and basis vector selection in any 
SGPC design. The proposed adaptive sampling based basis vector selec- 
tion method aids in achieving improved generalization performance at a 
reduced computational cost. This method can also be used in conjunc- 
tion with any other site parameter estimation methods. It has similar 
computational and storage complexities as the well-known information 
vector machine and is suitable for large datasets. The hyperparameters 
can be determined by optimizing a predictive loss function. The exper- 
imental results show better generalization performance of the proposed 
basis vector selection method on several benchmark datasets, particu- 
larly for relatively smaller basis vector set sizes or on difficult datasets. 

Key words: Gaussian process, Classification, Sparse models, Additive 
models 
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1 Introduction 

Sparse Gaussian Process (GP) classifier design aims at addressing the issues of 
high computational and storage costs associated with learning a full model GP 
(0(n 3 ) and 0{n 2 ) respectively) [6] using n training examples, and involves using 
a representative data set, called the basis vector set, from the input space. In this 
way, the computational and memory requirements are reduced to Olnd^^) and 
0(nd max ) respectively, where d max is the size of the basis vector set (d max <C n). 
Further, the costs of predictive mean and variance computations for an example 
are reduced from 0(n) and 0(n 2 ) to 0{d max ) and 0(d 2 nax ) respectively. 

In this work, we focus on developing an efficient Sparse Gaussian Process 
Classifier (SGPC) design algorithm. Several approaches have been proposed in 
the literature to design sparse GP classifiers. These include on-line GP learning 
[1. and entropy or information gain based Informative Vector Machine (IVM) 
[4 8 . Particularly relevant to this work is IVM which is inspired by the technique 
of assumed density filtering (ADF) [511] . In general, an SGPC design algorithm 
using the ADF approximation involves site parameter estimation, basis vector 
selection and hyperparameter optimization. While the site parameters are esti- 
mated using a moment matching technique in the ADF approximation, hyperpa- 
rameters are estimated by optimizing marginal likelihood or negative logarithm 
of predictive probability (NLP) [6 . Different methods to select the basis vectors 
include entropy, information gain and validation based methods [9] . Experimen- 
tal comparisons of the IVM with entropy based method and validation based 
method on various benchmark datasets showed that though the IVM method 
is efficient, it does not generalize well particularly on difficult datasets, and it 
requires more number of basis vectors to achieve similar generalization perfor- 
mance compared to the validation based method. Though the validation based 
method generalizes well, it is computationally expensive. Therefore, there is a 
need to have an efficient algorithm to design SGPCs that generalize well. 

Contributions: Viewing SGPC design as construction of an additive model 
(that is, a linear combination of basis functions) [3], a basis vector addition can 
be seen as adding a basis function in each iteration like in boosting 7 . With this 
view we introduce new methods to select the basis vectors and, estimate their 
site parameters by optimizing a predictive loss function. These estimated site 
parameters determine the coefficient of the basis function in the additive model. 
Further, an adaptive sampling based basis vector selection method is proposed, 
which aids in effective basis vector selection and computational cost reduction. 
The proposed basis vector selection method has same computational complexity 
as used by IVM. We also compare the generalization performance of various basis 
vector selection methods. Experimental results show that the proposed method 
gives comparable or better performance on a wide range of real-world large 
datasets. In particular, the proposed method is significantly better compared 
to the entropy and information gain based methods for relatively smaller d max 
values or on difficult datasets. 

The paper is organized as follows. Section 2 presents an SGPC design algo- 
rithm with the ADF approximation. The proposed methods and implementation 
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aspects are given in Section 3. Section 4 covers related work. Experimental re- 
sults are presented in Section 5 and the paper concludes with Section 6. 

2 GP and Sparse GP Classification 

Given a training data set with input-output pairs T> = {x;, where Xj € R d 

and i/i € {+1,-1}, the goal is to design a GP classifier that generalizes well. In 
standard GPs for classification [BJ, true function value at each x« is represented 
as a latent random variable /(xj). Let us denote /(Xj) by /^. The prior distribu- 
tion of {f(X„)} is a zero mean multivariate joint Gaussian, denoted as p(f) — 
Af(-; 0,K), where f = [f u . . . , f n ] T , X n — [xi , . . . , x n ] , and K is an n x n covari- 
ance matrix whose (i,j) th element is A^x^x^.An example covariance function 

is the squared exponential function: fc(xi,x 3 ) = vq exp(— ^ Y^L=i ' yX% ' m Ji J,T "^ ). 
Here, vq and <r 2 denote the signal variance and kernel width respectively. In 
this work we use the probit noise model, p(yi\fi, X,b) = <P(Xyi(fi + b)) where 
<P(-) is the cumulative distribution of the standard Gaussian Af(-; 0, 1) with zero 
mean and unit variance, the slope of which is controlled by A(>0) and 6 is a 
bias hyperparameter. With independent, identical distribution assumption, we 
have p(y|f,7) = II™=iP(j/»l/«;7) where 7 = [A, b}. Let 9 = [t; ,cr 2 ,7] denote 
the hyperparameters that characterize the GP model. With these modeling as- 
sumptions, the expressions for latent posterior and predictive distributions are 
available [BJ. In SGPC design using the ADF approximation [4 , a factorized 
form of <7 u (f|2?, 6) (given below) is made use of, to build an approximation to 
p({ \T>, 6) in an incremental fashion. Let u denote the index set of the training 
examples which are included in the approximation. Then we have 

q u (f \V, 0) ex 7V(f; 0, K) ]J exp {~(fi ~ ™,) 2 } (1) 

and p(f\V,0) fa q u (f\V,0) = 7V(f;f,A) where A = (K _1 + ny 1 and f = 
AiTm, m = (mi, . . . , m n ) T and II = diag(pi, . . . ,p n ). The parameters rrii and 
Pi, i = 1 — > n are called the site function parameters and the set u is called 
the active or basis vector set. Note that u is actually associated with the in- 
puts X u . We refer to u c = {1, 2, . . . , n} \ u as the non-active set. In practice, 
the active set size |u| is restricted by the user specified parameter, d max . Note 
that the site function parameters corresponding to u c are zero. Thus a SGPC 
model is defined by the basis vector set u, its associated site function parame- 
ters (m u , JT U ) and the hyperparameters 6. In general, SGPC design algorithms 
differ with respect to the basis vector selection, site parameter estimation and 
hyperparameters optimization methods. A typical SGPC design algorithm using 
the ADF approximation is given in Algorithm 1. 

We now briefly describe the ADF approximation method [4] to implement 
step 4. Suppose that an example index j is added to the current basis vector set 
u. Let xij = uU {j}. After updating the site function parameters pj and rrij, in- 
cremental calculations are carried out to update f and diag(A) corresponding to 
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Algorithm 1 SGPC Design 



1. Initialize the hyperparameters 0. Set dmax, tol, iter max and, iter=0. 
repeat 

2. Initialize A := K, u = {}, u c = {1, 2, . . . , n}, fi = pi = rrn = V i G u c . 

iter — iter + 1. 

repeat 

3. Select a basis vector j from u c as per the chosen basis vector selection method. 

4. Update the site parameters Pj, m.j, posterior mean (f) and variance 
(diag(A)). 

5. Set u = u U {j} and u c = u c \ {j}. 
until |u| = d max 

6. Re-estimate the hyperparameters by optimizing a suitable loss function, keep- 
ing u and the corresponding site parameters constant. 
until iter — itermax or change in the loss function value < tol 



Uj . This is achieved by maintaining two matrices L and M where L is the lower- 
triangular Cholesky factor of B = 1 + JT^K u , u i7^ and M = L^JIu^K^I 
Note that A = K — M T M. However, only the diagonal elements of A are needed 
in the algorithm and are updated as given in ([3]) below. Assuming A = 1, with 

Zj = M4g, a 3 = Vj = a, (a, + fef) the site function 

parameters are updated as: 

i> r^hr- f' V- (2) 



Let 1 = yfJM.j, I = x T~~p~K~, FT. ii = I '■■.^JTJK.., - M T 1). Then M 
is updated by appending the row vector fi T and L is updated by appending [L 0] 
with [l T I]. The posterior variance and mean are updated as: 

diag(A) := diag(A) - /x 2 , f := f + aj lpJ 1/2 fi. (3) 

In ([3]), fi 2 denotes squaring of each element in fi. In the outer loop the hy- 
perparameters are optimized by maximizing the marginal likelihood (ML) [3], 
<7„(y|X, 0) = Jp(y|f,7)g u (f \T>, 6)df or minimizing the negative logarithm of 
predictive probability (NLP) loss (under cumulative Gaussian noise model) [9], 



' ' iGu c \ 



M + b) 



(4) 



Finally the predictive target distribution for an unseen input cc* is given by: 
<7u(y*|x») = $ ( j'ji'^J'J ^j wnere /* = k* )U il2B _1 .7Tj!m u and er 2 = fc(x*,x*) - 

3 

The subscript, (u, u), of a matrix is used to represent the rows and columns of the matrix corresponding to the 
elements of the set u. The subscript, (u, .) denotes the rows of the matrix corresponding to the elements of the 
set ii. 



An Additive Model View to Sparse Gaussian Process Classifier Design 



5 



k* :U i7uB _1 i7uk Ui *. In the next section we propose new methods for effective 
basis vector selection and site parameters optimization (steps 3 and 4 in Algo- 
rithm 1). 



3 Proposed Methods 

Friedman et al [5] showed how boosting [7] can be seen as a way of fitting an 
additive model, /m(x) = Em=i w m ^( x i ^m) where w m , m = 1,2, ...,M are 
the expansion coefficients, and tp(x.; S m ) € TZ are the basis functions character- 
ized by the parameters 6 m , m — 1,2,...,M, and M is the number of basis 
functions (M — d max ). This model is fit by minimizing a loss function aver- 
aged over the training data, that is: min^ Wm Sm yM Yli=i ex P(~yifM(^i)) where 
an exponential loss function is used. In forward stagewise additive modeling 
the basis functions are added one at a time and, the coefficient and the basis 
function parameter (w rn ,S m ) are optimized by keeping the coefficients and pa- 
rameters of the previously chosen basis functions constant. That is, {w m ,S m } = 
argm.m w ,s e%p(—yifm-i( x i) + w4>(x;5)),m > 1. Friedman et al [3] also pre- 
sented a related loss function that is based on the binomial likelihood, given by: 
E"=i iogC 1 + exp(-2yif M (xi))). 

With this view, we consider the SGPC design as constructing a forward 
stagewise additive model. Before we show the equivalences between the selection 
of a basis function and its coefficient to the selection of a basis vector (j) and its 
site parameters (pj,rrij), we define an objective function (called predictive loss 
function) that we propose to use to select a basis function and its coefficient in 
each iteration of the SGPC design algorithm: 

where j € u c and, /j, are computed using {u U j}. This objective function 
has a behavior similar to the exponential and binomial likelihood loss functions 
mentioned above and, is also an upper bound on the training set error. That is, 
we have: 

-\{i:sgn(f( Xl )+b)^ yi }\ < ^_NLP a (u,0) (6) 
71 log(2) 

Here the left hand side represents the training set error. The inequality follows 
from noting that < $(z) < 1, ^(0) = 0.5 and that — log(#(z)) monotonically 
decreases in the interval (— oo, oo). Note that log(^(0)) = — log(2) and is required 
for appropriate scaling so that — — ^ w hen z < 0. Thus ((SJ is an upper 

bound on the training set error. 

Comparison of Objective Functions: Firstly, unlike the exponential loss 
function used in boosting, the function #(•) is not separable. That is, the linear 
combination of basis functions that appear inside <P(-) cannot be written as a 
product of individual terms. This separability property of the exponential func- 
tion is useful for the interpretation of building successive weak classifiers on the 



6 Sundararajan and Shirish 



training data with weighted distribution. However, keeping all the previous ba- 
sis functions with the associated coefficients fixed and, optimizing over only an 
additional basis function along with its coefficient using (J5]) essentially has the 
same desirable effect. It may be noted that like (|5|), the binomial log likelihood 
is not separable in strict sense (without any approximation). Secondly, the GP 
classifier has the advantage of providing predictive variance information which 
is useful in moderating the predictive probability. Specifically when the uncer- 
tainty or variance is large, this probability gets reduced accordingly. This is very 
important particularly when the data points are sparse in a certain region of the 
input space or when the data is noisy. Thus, use of (JS|) would be more robust. 
The behaviors of — log(#(-)) with and without moderation along with the other 
loss functions are shown in Figure [1] 




Fig. 1. Exponential, binomial log likelihood, — l °^^2)^ functions with and without moderation. 
In the moderation case, the variance was set to 0.5. Zero variance corresponds to no moderation. A 
reference function that takes unit value is also shown. 

Forward Stagewise Additive Model View: We now show using that the 
SGPC design using Algorithm 1 with ((5]) as the objective function (to select the 
basis vectors and their coefficients) is equivalent to building a forward stagewise 
additive model. In particular, a basis vector selection results in a basis function 
choice and the coefficient optimization essentially results in its site parameters 
estimation in each iteration (steps 3 and 4 in Algorithm 1). Note that the notions 
of stage and iteration in Algorithm 1 are equivalent. First, let us look at the steps 
3 and 4 of Algorithm 1 more closely. After selecting a basis vector j and updating 
its site parameters (pj , rrij ) at the t-th iteration, the following posterior variance 
and mean update can be obtained by simplifying @: 

diag(A)(* +1 ) := diag(A)W - % k 2 j5 f(*+i) : = f« + „,k.., (7) 

i i 

where k j = (k j — k Ut JTujB^iT 5 k Ut .) and u t is the basis vector set at 
the t-th iteration. Here r)j = — Pj A ( t > and ctj = f]j( m j ~ fj )■ Note that 

Vj > 0- Then the process of adding the jth basis vector is equivalent to adding 
a basis function k(x, Xj). That is, we can define the additive model function 
for SGPC as: / (t+1) (x) = / (t) (x) + <5 J fc(x,x i ). Here, fc(x,x 3 ) = fc( x , x j) - 

1 1 

k(x, x Ut )iTS t B~ t 1 iT^k(x Ut ,x) (where k(x,x U£ ) is a row vector of size | u t | ) , and 
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is dependent on the input through fc(x, x 3 ), the previously chosen functions 
and their site parameters. Note that in both the ADF approximation and the 
proposed methods, the site parameters of the previously selected basis vectors 
are not updated whenever a new basis vector is added. This is done to reduce the 
computational complexity. Next, we can see that the choice of ctj is dependent 

on the site parameters rrij andpj. This is because fj and A^-*- are fixed once the 
jth basis vector is chosen. Now, relating /(* +1 )(x) to the predictive mean vector 
in (JT]), we see that the predictive mean vector is nothing but the evaluation of 
the function /( t+1 )(x) for the training inputs x^,i = l,...,n. Therefore, selec- 
tion of the jth basis vector and estimation of its site parameters (pj , rrij ) in each 
iteration (stage) of the SGPC design algorithm essentially determine the basis 
function fc(x, Xj) and its coefficient ctj. To summarize, we have the final classifier 
function (excluding the bias hyperparameter b) and the predictive variance on 
an input x as: 



/(x)= £ aifc(x )Xi ) (8) 

i=l 

dm.. 

a 2 (x) = fc(x,x) - »7fA(x,Xi) (9) 
»=i 

Note that the expression for <t 2 (x) follows from the expression for diag(A)(* +1 ) 
on the left hand side of ([7]). It is interesting to see that the variance is a non- 
increasing function as more and more basis functions are added. Having shown 
the equivalence, we next show how the jth basis function and the associated 
coefficient ctj can be obtained by optimizing ((5|) in each iteration. As we have 
seen before, the choice of a basis vector determines the basis function and we 
describe next how this selection is done. 

Basis Vector Selection Method: From efficiency viewpoint, we propose to 
select a basis vector as: 

j = argminNLP Q ({uU i},9). (10) 

where J, a working set, is a randomly chosen subset of u c , | J|=min(K,|u c |) and n 
can be set to 59 [10] . To select one basis vector using (|T0| the computational cost 
is 0(Knd m ax)- Therefore a method to reduce the factor n without significantly 
degrading generalization performance will be very useful. We achieve this by 
changing the sampling strategy (from random sampling) used to construct the 
working set J. In the proposed adaptive sampling technique, we construct J by 
sampling from u c according to a distribution that changes after a basis vector 
is added in each iteration. The sampling distribution is given by: 

vm = J-(<i-*( y s$ t)+b) )) an 



8 Sundararajan and Shirish 



where V® is a normalizing constant. Here, f • and are computed using the 
basis vectors in u t . Since / and A change after inclusion of every basis vector in 
the inner loop, the distribution also changes and the sampling becomes adaptive. 

To understand why such a sampling along with ([10[) would be useful, we 
can see that if — > 1 (for a correctly classified example with high predictive 
probability), then the probability of selecting such an example as a basis vec- 
tor will be relatively small. On the other hand, the probability of selecting a 
misclassified example with low predictive probability (that is, ^(-) — > 0) will be 
relatively high. We found that selecting the most violated example (that is, the 
example with the least <P(-) in u c ) in each iteration results in poor basis vector 
selection for noisy and difficult datasets. The adaptive sampling technique can 
safeguard against such a selection and is robust across different datasets. Next, 
the sign of &j in ([7]) gets adjusted in such a way that f moves in the desired 
direction for a given k j. This desired movement is expected to happen for all 
the examples having same class label that are close enough to the jth example. 
Therefore, with a choice of an example (having low value of ^>(-)), f( t+1 ) moving 
in the desired direction and variance diag(A.)( t+1 ^ non-increasing, we expect the 
NLP value in ([5]) to improve particularly for the examples with wrong predic- 
tions or low predictive probability. In this sense the basis vector selection using 
(fll))) and pT|) tends to mimic the selection of a base classifier in boosting [7J that 
minimizes the training set error with weighted distribution. This helps in getting 
a better generalization performance for a fixed k compared to random sampling. 
Alternatively, k can be reduced to get the same generalization performance. 
Experimental results support these claims. 

Site Parameters Optimization Method: Having constructed the working 
set J using the adaptive sampling technique, we optimize ([S]) to find on for 
each basis vector i G J. As shown earlier optimizing over at is equivalent to 
optimizing over the site parameters rrii and pi for a given basis vector. Essentially 
we have a two dimensional (rrii,Pi) non-linear optimization problem. Note that 
it is a constrained optimization problem (under certain condition given below) 
since the posterior variance diag(A)^ t+1 ^ should be non-negative after every 

iteration. Assuming that diag{A)^ ts> is non-negative it turns out that pi must 

A (*) 

satisfy: fji> r]i — ^ P ' A m wnere Vi — mini O n further simplification we 

find that if JftAvP > 1 then we have an unconstrained optimization problem (in 
when we work with pi = exp(r.i)); otherwise we have a constrained optimization 
problem with < pi < — r^Hrr ■ This can be solved using any standard nonlinear 

optimization technique. To summarize, we construct J using adaptive sampling, 
optimize cti, Vi € J and select the basis vector using U0\) . 

4 Related work 

In this section we briefly describe three closely related methods that we compare 
with the proposed method. In entropy based method [1], a basis vector is chosen 



An Additive Model View to Sparse Gaussian Process Classifier Design 



9 









40 




40 




40 






= ITOI 




t 30 




t 30 




L 


t.Set I 




.Set 1 




LSet 1 






I- 10 




*~ 10 




1- 10 






















50 100 150 50 100 150 50 100 150 50 100 150 





a. 0.5 






I 


























Z 0.4 




J 0.4 






1 0.4 










^^^^^ 








M 










g 0.3 


















0.2 




H 0.2 






0.2 













50 100 150 50 100 150 50 100 150 50 100 150 



Fig. 2. The left panel (group of 4 plots) corresponds to Waveform dataset and the right panel 
corresponds to Image dataset. The first and second rows show the training/test set errors and NLP 
loss as the basis vectors are added in the inner loop just before termination. The solid-red and 
dashed-blue lines correspond to a with moment matching and constrained optimization cases with 
ADF approximation. In this experiment we set k — 2. 

according to the change in the entropy of the posterior process ([T]) after inclusion 
in the model and is given by: j — argminig u c log(Ai) where Aj = 1 — fiA-a. In 
information gain based method [5] , a basis vector is chosen according to the in- 
formation gain (which is defined as negative of the Kullback-Leibler divergence) 
obtained from the posterior process after inclusion in the model and is given by: 

j = argmin ieu c{— log(Ai) + p + ^a-^ }■ Here /j and f- denote the predictive 
mean before and after the inclusion of the ith basis vector. Compared to the 
entropy based selection, this method takes the predictive mean also into account 
and differs from the way Xi is traded-off between the first and second term. Both 
these methods are very efficient since the relevant quantities that are needed to 
compute the appropriate measure for the basis vector selection are maintained 
throughout in the inner loop of Algorithm 1. Both these methods maximize the 
marginal likelihood for hyperparameter optimization. 

In validation based method [9,, a working set J C u c of fixed size k (k = 
min(|u c |, 59)) is constructed by sampling randomly from u c . The basis vector 
that minimizes (j4]) is chosen and this involves computation of a new NLP value 
after inclusion for each i 6 J. Thus the computational cost for one basis vector 
selection is 0{nnd max ). The hyperparameters are selected by minimizing 

While all the three methods use moment matching with the ADF approxima- 
tion to estimate the site parameters ([2]) , the proposed site parameters optimiza- 
tion method provides an alternate way to estimate these parameters. Note that 
one can also use ^ in conjunction with the proposed adaptive sampling based 
basis vector selection method. On comparing the objective functions used by the 
proposed method and validation based method, we see that the form of (U]) is 
same that of §5§ except that the summation happens only over u c . While the 
validation based method viewed (HJ) as obtaining the NLP performance estimate 
with a validation set, ([SJ is motivated from the additive modeling viewpoint 
and, minimizing an upper bound on the training set error. Furthermore, the 
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validation based method uses fixed uniform sampling instead of adaptive sam- 
pling. Note that the difference between (jU) and (JSJ is expected to be insignificant 
when dmax <C n (usually the case in SGPC design with large datasets) and this 
condition is important to avoid any overfitting. 

5 Experiments 

The summary of the datasets used in the experiments is given in Table 1. 
These datasets are part of Gunnar Raetsch's benchmark datasets available at 
http : //theoval . cmp .uea. ac .uk/~gcc/mat lab/default .html. We changed the 
training and test set sizes of the top five datasets in Table 1 to demonstrate the 
effectiveness of the proposed method on large datasets. For the first four datasets 
we picked top 3600 test examples from the original test set partition and added 
to the training set. The remaining examples were used as the test set. Note that 
this construction however results in reduction of the test set size. In the case 
of Splice dataset, we picked the top 1000 examples from the test set partition. 
The modified train and test set sizes are shown Table 1. We considered only the 
first 25 partitions of the first four datasets. In all the experiments we used the 
squared exponential covariance function and Algorithm 1 described in Section 2. 
A conjugate gradient method was used to optimize (U) (unless otherwise speci- 
fied) in the outer loop for optimizing the hyperparameters, and iter max was set 
to 20. We kept track of the best model based on the NLP loss value after every 
outer loop iteration. For comparison, we evaluated the test set error and NLP 
loss performance. 

Table 1. Datasets Description, n and m denote the training and test set sizes, d and pt denote 
the input dimension and number of partitions. 



Dataset 


n 


m 


d 


pt 


Banana 


4000 


1300 


2 


25 


Waveform 


4000 


1000 


21 


25 


Twonorm 


4000 


3400 


20 


25 


Ringnorm 


4000 


3400 


20 


25 


Splice 


2000 


1175 


60 


20 


Image 


1300 


1010 


18 


20 



We conducted three experiments. Due to the space constraints we present 
only selected results. In the first experiment we illustrate the effectiveness of the 
proposed method of site parameters (equivalently, a) optimization. The results 
on one partition of the Waveform and Image datasets are shown in Figure 2. 
This method is compared against using (f2J) for site parameters optimization. 
Although some minor variations were seen between the two methods, statistical 
analysis showed that the performance differences were not significant. Thus, 
the constrained optimization is an effective alternate method to estimate the 
site parameters. We now discuss certain practical aspects of this optimization. 
During optimization, the variance can become zero (within numerical accuracy) , 
for some choice of the hyperparameter values and, also due to the greedy nature 
of the basis vector selection method. While this can be handled in some way 
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(for example by exiting the inner loop), optimizing over individual a^s can 
become slightly expensive for large datasets. Note that the function and gradient 
computations are linear in n. We can control the optimization cost by restricting 
the number of function and gradient evaluations with some inaccuracy in the 
solution. Therefore, the proposed optimization is also efficient. 

In the next two experiments, we kept the site parameter estimation (using 
and the hyperparameter estimation (using ([%])) same, and only changed the 
basis vector selection method in the step 3 of Algorithm 1. This is because our 
goal here is to compare the quality of the different basis vector selection methods. 
First, we demonstrate the effectiveness of the adaptive sampling method in the 
basis vector selection. This is done by comparing it with random (uniform) 
sampling method. We conducted this experiment on all the datasets given in 
Table 1. The test set error and NLP loss performance results on two datasets are 
given in Figure [3] (left panel) for two different values of d max . These results were 
obtained by averaging the performance over the partitions. We found that the 
adaptive sampling method consistently performed better across all the datasets, 
particularly with respect to the NLP loss measure. This is because the choice 
of the basis vectors made by the adaptive sampling method is based on the 
predictive distribution. We also observed improved test set error performance 
on several cases. It was also observed that the performance difference reduces 
as the working set size n increases. It can also be seen that k value of 2 is 
sufficient for the adaptive sampling method to get similar NLP generalization 
performance as the validation based method (see the second column in the left 
panel of Figure [3]) • 

In the third experiment, we compared the performance of the proposed 
method, validation based method, entropy and information gain based basis 
vector selection methods. In the case of proposed method, we evaluated the per- 
formance with k — 1 and 2, thus ensuring that the complexity for the basis 
vector selection is the same as that of the entropy and information gain based 
methods. We conducted this experiment for four different values of d max (40, 80, 
160 and 320) on all the datasets given in Table 1. The test set error and NLP 
loss performance on three datasets are shown in Figure [3] (right panel) . They 
were obtained by averaging the performance over the partitions. We compared 
the performance of various methods using statistical significance tests. We first 
conducted Wilcoxon test on the test set error and NLP loss obtained from the 
partitions, on each dataset. All the observations from the tests below are made 
at the significance level of 0.05. The results indicated better test set error and 
NLP performance of the proposed method over the entropy and information 
gain based methods on almost all the datasets. Specifically we observed that the 
proposed method performed better on difficult datasets (relatively higher test 
set errors) like Banana, Waveform and Splice for all values of d max with respect 
to (w.r.t.) both the measures. On Twonorm and Ringnorm datasets it performed 
better w.r.t. the NLP loss measure for all the values of d max . While it performed 
better than the entropy based method on the Ringnorm dataset for all values of 
dmax w.r.t. the test set error, the performance was the same at higher values of 
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dmax in other cases. The information gain based method performed better than 
the entropy based method on the Banana, Waveform and Ringnorm datasets. The 
entropy based method performed better than the information gain based method 
w.r.t. the test set error in the case of Twonorm dataset. We observed that the 
entropy based method performed better than all the methods at lower values of 
d m ax on the Image dataset. On comparing the proposed method (k = 1 and 2) 
with the validation based method, we found that the validation based method 
performed better w.r.t. the test set error at lower values of d max (40 and 80). 

Next, following [2], we conducted Friedman's test with six datasets (Table 1) 
and four methods, namely, the proposed method (with K=2), validation, entropy 
and information gain based methods. To conduct this test, we used the average 
test set error and NLP values obtained from averaging over the partitions. The 
p-values obtained for the test set error and NLP measure were (0.02, 0.04, 0.04, 
0.39) and (0.002, 0.002, 0.01, 0.09) respectively for four different values of d max 
(40, 80, 160 and 320) in that order. When d max was 320, the results were not 
significantly different w.r.t. both the measures. Since the null hypothesis was re- 
jected for d max values of 40, 80 and 160, we next conducted the Bonferroni-Dunn 
post-hoc test to compare the proposed method with the other three methods. 
This test revealed that there were no significant differences between the proposed 
and validation based methods for all values of d max w.r.t. both the measures. On 
comparing the proposed method with the entropy and information gain based 
methods, we found that while the results were not significantly different w.r.t. 
the test set error, they were significant w.r.t. the NLP measure for lower d max 
values at 0.1 level. Overall, it was seen that the p- value became larger and the 
performance differences across the methods reduced as d max was increased. 

Except for the validation based method (k = 59), all the methods required al- 
most the same computational time for the basis vector selection. An approximate 
timing measurement of one inner loop (for d max —80) showed that the proposed 
method with k — 1 took approximately 20 seconds for the Banana dataset (on a 
machine with 2 GB of RAM and dual core Intel CPU running at 1.83 GHz). In 
general, we found that the proposed method was 5 times faster than the valida- 
tion based method on almost all the datasets. This comparison was based on the 
Matlab implementations of these methods. The speed improvement was not as 
high as 59. We believe that efficient matrix based operations in Matlab helped 
the validation based method significantly and, expect the speed improvement to 
be higher with implementations in other programming languages like C. 

6 Conclusion 

We considered the problem of designing an SGPC from an additive model es- 
timator viewpoint. We introduced new methods for basis vector selection and 
site parameters estimation based on the predictive loss function. An adaptive 
sampling method that aids in effective basis vector selection and computational 
complexity reduction was proposed. The proposed basis vector selection method 
has same computational and storage complexities as that used by IVM and, is 
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thus suitable for large datasets. The experimental results showed better gener- 
alization performance of the proposed method on several benchmark datasets, 
particularly for relatively smaller d max values or on difficult datasets. 
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Fig. 3. Left Panel of eight plots: Test set error and NLP loss performance of the random 
sampling (dashed-red-square) and adaptive sampling (solid-blue-circle) methods for different values 
of k. The dashcd-dot-black line corresponds to the validation based method with k— 59. Top two rows 
correspond to Waveform dataset for d max — 40 and 80 (in that order). The bottom rows correspond 
to Twonorm dataset for t/ maa; — 40 and 80. Right Panel of six plots: Test set performance of the 
various basis vector selection methods (entropy, information-gain, proposed method with k— 1 and 
2, and validation based method (k— 59) (different gray shades) in that order) for different values of 
dmax (40, 80, 160 and 320 correspond to the x-axis values of 1, 2, 3 and 4 respectively). Each row 
corresponds to one dataset. The results on Banana, Waveform and Twonorm datasets are given in that 
order. 
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